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Abstract 

Information distances like the Hellinger distance and the Jensen-Shannon divergence 
have deep roots in information theory and machine learning. They are used extensively 
in data analysis especially when the objects being compared are high dimensional empir¬ 
ical probability distributions built from data. However, we lack common tools needed to 
actually use information distances in applications efficiently and at scale with any kind 
of provable guarantees. We can’t sketch these distances easily, or embed them in bet¬ 
ter behaved spaces, or even reduce the dimensionality of the space while maintaining the 
probability structure of the data. 

In this paper, we build these tools for information distances—both for the Hellinger 
distance and Jensen-Shannon divergence, as well as related measures, like the divergence. 

We first show that they can be sketched efficiently (i.e. up to multiplicative error in 
sublinear space) in the aggregate streaming model. This result is exponentially stronger 
than known upper bounds for sketching these distances in the strict turnstile streaming 
model. Second, we show a finite dimensionality embedding result for the Jensen-Shannon 
and divergences that preserves pair wise distances. Finally we prove a dimensionality 
reduction result for the Hellinger, Jensen-Shannon, and x^ divergences that preserves the 
information geometry of the distributions (specifically, by retaining the simplex structure 
of the space). While our second result above already implies that these divergences can 
be explicitly embedded in Euclidean space, retaining the simplex structure is important 
because it allows us to continue doing inference in the reduced space. In essence, we 
preserve not just the distance structure but the underlying geometry of the space. 

* This research was funded in part by the NSF under grants CCF-0953066, CCF-0953754, CCF-1320719 
and BIGDATA-1251049 and by a Google Faculty Research Award. 
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1. Introduction 


The space of information distances includes many distances that are used extensively in data 
analysis. These include the well-known Bregman divergences, the a-divergences, and the /- 
divergences. In this work we focus on a subclass of the /-divergences that admit embeddings 
into some (possibly inhnite-dimensional) Hilbert space, with a specific em phasis on th e JS 
divergence. These diver gences are used in statistical t ests and estimator s (IBeranl. 1197711. as 


well as in image analysis (IPeter and R angaraianl . 1200811 . comput e r vision (IHuang et al 


Mahmoudi and Sapirol . 2009l l. a nd text analysi s ( Dhillon et al 


2005 


20031 : lEiron and McCurlev . 


2003 ). They were introduced by Csiszar (1963), and, in the most general case, also include 
measures such as the Hellinger, JS, and divergences (here we consider a symmetrized 
variant of the distance). 

To work with the geometry of these divergences effectively at scale and in high dimen¬ 
sions, we need algorithmic tools that can provide provably high quality approximate rep¬ 
resentations of the geometry. The techniques of sketching, embedding, and dimensionality 
reduction have evolved as ways of dealing with this problem. 

A sketch for a set of points with respect to a property P is a function that maps 
the data to a small summary from which property P can be evaluated, albeit with some 
approximation error. Linear sketches are especially useful for estimating a derived property 
of a data stream in a fast and compact wayI3 Complementing sketching, embedding 
techniques are one to one mappings that transform a collection of points lying in one 
space X to another (presumably easier) space Y, while approximately preserving distances 
between points. Dimensionality reduction is a special kind of embedding which preserves 
the structure of the space, while reducing its dimension. These embedding techniques 
can be used in an almost “plug-and-play” fashion to speed up many algorithms in data 
analysis: for example for near neighbor search (and classification), clustering, and closest 
pair calculations. 

Unfortunately, while these tools have been well developed for norms like ii and £ 2 , we 
lack such tools for information distances. This is not just a theoretical concern: information 
distances are semantically more suited to many tasks in machine learning, and building the 
appropriate algorithmic toolkit to manipulate them efficiently would expand greatly the 
places where they can be used. 


1.1. Our contributions 


Sketching information divergences. iGuha. Indvk. and McGregod (j2007|) proved an 
impossibility result, showing that a large class of information divergences cannot be sketched 
in sublinear space, even if we allow for constant factor approximations. This result holds in 
the strict turnstile streaming model —a model in which coordinates of two points x, y C 
are increased incrementally and we wish to maintain an estimate of the divergence between 
them. They left open the question of whether these divergences can be sketched in the 
aggregate streaming model, where each element of the stream gives the ith coordinate of x 
or y in its entirety, but the coordinates may appear in an arbitrary order. We answer this in 


1. Indeed iLi. Nguyen, and Woodru3 (120141 ') show that any optimal one-pass streaming sketch algorithm in 
the turnstile model can be reduced to a linear sketch with logarithmic space overhead. 
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the affirmative for two important information distances, namely, the Jensen-Shannon and 
divergences. 

Theorem 1 A set of points P under the Jensen-Shannon(JS) or divergence can he 
deterministically embedded into 0(—log-) dimensions under £2 withe additive error. The 
same space bound holds when sketching JS or the aggregate stream model. 

Corollary 2 Assuming polynomial precision, an AMS sketch for Euclidean distance can 
reduce the dimension to O (^log^logd) for a (1 + e) multiplicative approximation in the 
aggregate stream setting. 

Theorem 3 A set of points P under the JS or divergence can be embedded into £2 with 
d = O with (1 + e) multiplicative error. 

For the both techniques, applying the Euclidean JL-Lemma can further reduce the dimen¬ 
sion to O in the offline setting. 

Dimensionality reduction. We then turn to the more challenging case of performing 
dimensionality reduction for information distances, where we wish to preserve not only 
the distances between pairs of points (distributions), but also the underlying simplicial 
structure of the space, so that we can continue to interpret coordinates in the new space 
as probabilities. This notion of a structure-preserving dimensionality reduction is implicit 
when dealing with normed spaces (since we always map a normed space to another), but 
requires an explicit mapping when dealing with more structured spaces. We prove an analog 
of the classical JL-Lemma : 

Theorem 4 For the Jenson-Shannon, Hellinger, and x^ divergences, there exists a struc¬ 
ture preserving dimensionality reduction from the high dimensional simplex to a low 
dimensional simplex A^, where k = 0((logn)/e^). 

The theorem extends to “well-behaved” /-divergences (See Section [3] for a precise defini¬ 
tion). Moreover, the dimensionality reduction is constructive for any divergence with a finite 
dimensional kernel (such as the Hellinger divergence), or an inhnite dimensional Kernel that 
can be sketched in finite space, as we show is feasible for the JS and divergences. 

Our techniques. The unifying approach of our three results—sketching, embedding into 
£ 2 , and dimensionality reduction—is to analyze carefully the infinite dimensional kernel 
of the information divergences. Quantizing and truncating the kernel yields the sketching 
result, sampling repeatedly from it produces an embedding into Finally given such an 
embedding, we show how to perform dimensionality reduction by proving that each of the 
divergences admits a region of the simplex where it is similar to We point out that to the 
best of our knowledge, this is the first result that explicitly uses the kernel representation of 
these information distances to build approximate geometric structures; while the existence 
of a kernel for the Jensen-Shannon distance was well-known, this structure had never been 
exploited for algorithmic advantage. 
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2. Related Work 


The works by Fuelede and TopsPe ( 2004l l . and then by Vedaldi and Zisserman ( 2012l l study 
embeddings of information divergences into an inhnite dimensional Hilbert space by repre¬ 
senting them as an integral along a one-dimensional curve in C. Vedaldi and Zisserman give 
an explicit formulation of this kernel for JS and divergences, for which a discretization 
(by quantizing and truncating) yields an additive error embedding into a finite dimensional 
However, they do not obtain quantitative bounds on the dimension of target space 
needed or address the questi on of multiplicative approxim a tion g uarantees. 

In the realm of sketches, iGuha. Indvk. and McGregorl (120071 ) show H(n) space (where 
n is the length of the stream) is required in the strict turnstile model even for a constant 
factor multiplicative approximation. These bounds hold for a wide range of information 
divergences, including JS, Hellinger and the divergences. They show however that an 
additive error of e can be achieved using O (^logn) space. In contrast, one can indeed 
achieve a multiplicative approximation in the aggregate streaming model for information 


diverg ences that have a hnite dimensional embedding into For instance, iGuha et al 


(j2006 ) observe that for the Hellinger distance that has a trivial such embedding, sketching 
is equivalent to sketching and hence may be done up to a (1 -|- e)-multiplicative approx¬ 
imation in ^logn space. This immediately implies a constant factor approximation of JS 
and divergences in the same space, but no bounds have been known prior to our work 
for a (1 -|- e)-sketching result for JS and x^ divergences in any streaming model. 

Moving o nto dimensionality reduction from simplex to sim plex, in the only other work we 
are aware of, Kvng. Phillips, and Venkatasubramanian ( 20101 ) show a limited dimensionality 
reduction result for the Hellinger distance. Their approach works by showing that if the 
input points lie in a specific region of the simplex, then a standard random projection will 
keep the points on a lower-dimensional simplex while preserving the distances approximately. 
Unfortunately, this region is a small ball centered in the interior of the simplex, which further 
shrinks with the dimension. This is in sharp contrast to our work here, where the input 
points are unconstrained. 

While it does not admit a kernel, the £i distance is also an /-divergence, and it is there¬ 
fore natural to investigate its potential connection with the measures we study here. For 
ii, it is well known that significant dimensionality reduction is not possible: an embedding 
with distortion \ + e requires the points to be embedded in e) dimen sions, which i 


IS 


nearly linear . This result was proved (and strengthened) in a series of re sults (jAndoni et al. 
2011 : Regev . 2012 : Lee and Naor . 2004 : Brinkman and Charikar . 2005). 


The general literature of sketching and em beddability in norme d spaces is too extensive 
to be reviewed here: we point the reader to Andoni et al. ( 20141 ) for a full discussion of 


results in this area. One of the most famous applications of dimension reduction is the 
Johnson-Lindenstrauss(JL) Lemma, which states that any set of n points in £2 can be 


embedded into O 


log n 


dimensions in the same space while preserving pairwise distances 
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Although sketching, embeddability, and dimensionality reduction are related operations, 
they are not always equivalent. For example, even though (.i and (.2 have very different 
behavior under dimensionality reduction, they can both be sketched to an arbitrary error 
in the turnst ile model (a nd in fact any ip nor m, p < 2 can be ske tched using p-stable 
distributions ( Indvk . 2000l )). In the offline setting, Andoni et ah ( 2014 ) show that sketching 
and embedding of normed spaces are equivalent: for any finite-dimensional normed space 
X, a constant distortion and space sketching algorithm for X exists if and only if there 
exists a linear embedding of X into iis- 


3. Background 

In this section, we define precisely the class of information divergences that we work with, 
and their specific properties that allow us to obtain sketching, embedding, and dimension¬ 
ality results. For what follows denotes the d-simplex: = {(xi,...,Xd) | = 1 

and Xi > 0, Vi}. Let [d] = {!,..., d}. 

Definition 5 (/-divergence) Let p and q be two distributions on [n]. A convex function 
f : [0, 00 ) —>■ M such that /(I) = 0 gives rise to an /-divergence Dj : A^ —M as: 

where we define 0-/(0/0) = 0, a-/(0/a) = adim^^o f{u), and 0-/(a/0) = adim„_j.oo fiu)/u. 


Definition 6 (Regular distance) We call a distance function D : X regular if there 
exists a feature map f : X —>■ V, where V is a (possibly infinite dimensional) Hilbert space, 
such that: 


D{x,y) = \\(p{x) - yx,yGX. 


_ The work of Fuglede and Topsqel ( 2004 1 establishes that JS is regular: [Vedaldi and Zisserman 

( 2012 ! ) construct an explicit feature map for the JS kernel, as (fi^) = /J^“d'a;(a;)dw, where 
'I'a;(a;) : M —C is given by 


'I'a;(<^) = exp(ia; lnx)i 


2x sech(7ru;) 
(ln4)(l -|- 4a;2) 


Hence we have for x, y G R, JS(x,y) = ||<^(x) — <^(y)|P = ||T 3 ;(c(;) — Tj^(a;)|p dw. 

The “embedding” for a given distribution p G A^ is then the concatenation of the functions 
(t>{Pi): i-e-, 4>{p) = ■ -Apd)- 

Definition 7 (Well-behaved divergence) A well-behaved f-divergence is a regular f- 
divergence such that /(I) = 0, //I) = 0, f"{l) > 0, and f'"{l) exists. 

In this paper, we will focus on the following well-behaved /-divergences. 
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Definition 8 The Jensen-Shannon (JS), Hellinger, and divergences between distribu¬ 
tions p and q are defined as: 


JS{p, q) 
He{p, q) 

x^{p, q) 


Pi -^ T log 

^ Pi+ qi 

l 


E 


{pi - qi? 
Pi + qi 


“^qj 

Pi+ qi' 


4. Embedding JS into 

We present two algorithms for embedding JS into The first is deterministic and gives an 
additive error approximation whereas the second is randomized but yields a multiplicative 
approximation in an offline setting. The advantage of the first algorithm is that it can 
be realized in the streaming model, and if we make a standard assumption of polynomial 
precision in the streaming input, yields a (1 + e)-multiplicative approximation as well in 
this setting. 

We derive some terms in the kernel representation of JS(x,y) which we will find conve¬ 
nient. First, the explicit formulation in Section [3] yields that for x, ?/ E M: 


JS(x,y) = 


r‘ + 00 


' —OO 
/‘ + 00 


Au In X 


/ 2xsech(7ra;) 

/ 2y sech(7ra;) 

/ (ln4)(l + 4a;2) y 

(ln4)(l -|- 4a;2) 


doj 


2 sech(7rc(j) 
(In 4) (1 + 4^2) 


Auj. 


For convenience, we now define: 


Jbj \ny\\2 


= (-\/xcos(a;lnx) — y^cos(a; Iny))^ + (-^/x sin(a; Inx) — ^/ysm{u:\ny))‘^, 


and 


k{uj) = 


2 sech(7ra;) 
(ln4)(l + 4a;2) 


We can then write JS(p, q) = Yli=i fj{Pii qi) where 

/■°° f 2x \ f 

fj{x, y)= h{x, y, uj)k(uj) dw = x log —^— + 2 / log —^— 

7-00 V^ + 2// \x-l-y 

It is easy to verify that k(u:) is a distribution, i.e., fY K(uj)du: = 1. 


4.1. Deterministic embedding 

We will produce an embedding fi(p) = ■ ■ ■, where each fip^ is an integral that we 

can discretize by quantizing and truncating carefully. 

To analyze Algorithm [H we first obtain bounds on the function h and its derivative. 
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Algorithm 1: Embed p G under JS into £ 2 - 
Input; p = {pi,... ,pd} where coordinates are ordered by arrival. 
Output: A vector of length O log j ^ 

[^ln(f)], 

for j i - J to J do 

I Wj ■(— j X ej32d 

end 

for i ^ 1 to d do 

for j i - J to J — 1 do 

a? y/WiCos{uj lnpi)y/ K{uj)duj 
b\ y^sin(wjlnpi)y/ K{uj)duj 

i^i + 1 

end 

end 

return concatenated with tf. 


Lemma 9 For 0 < x,y,< 1, we have 0 < h{x, y,uj) < 2 and 


dh{x,y,u)) 

duo 


< 16. 


Proof Clearly h{x,y,u}) > 0. Furthermore, since 0 < x,y < 1, we have 


h{x, y, uj) < y/xe 


Auj In X 




iu) In y 


= x + y < 2 . 


Next, 


dh{x,y,uj) 


duj 


= 12 (\/t cos(ti; Inx) — y^cos(cj Iny)) [—^/x sin(a; Inx) Inx + y^sin(ti; Iny) Iny) 

+2 (-v/xsin(wlnx) — y/ysm{u)lny)^ (-v/xcos(wlnx) Inx — y/y cos{u!lny) Iny) | 

< \2{^/x +y/y) {^/xlnx + y/y\ny)\+2\{^/x + y/y) {^/xlnx + y/ylny)\ < 16, 

where the last inequality follows since maxo<a;<i \y/x\nx\ <1. ■ 

The next two steps are useful to approximate the infinite-dimensional continuous repre¬ 
sentation by a finite-dimensional discrete representation by appropriately truncating and 
quantizing the integral. 

Lemma 10 (Truncation) For t > ln(4/e), 


fj{x,y)> J h{x,y,u})K{u!)du} > fj{x,y) - e . 

Proof The first inequality follows since h{x,y,u!) > 0. For the second inequality, we use 
h{x, y, oj) < 2: 


o-t 


1*00 poo 

h{x,y,u)K{uj)duj+ / h{x,y,u)K{(jj)duj < 4 / K{u)d(jj < 4 

Jt Jt 


00 


In 4 


-du) < 4e * < e 
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where the last line follows if t > ln(4/e). 


Define uoi = ei/16 for i E {..., —2, —1,0,1, 2,...} and h{x, y, u) = h{x, y, uji) where i = 
max{j I LOj < w}. 

Lemma 11 (Quantization) For any a,b, 

rb rb 


po no 

/ h{x,y,aj)K{uj)duj = / h{x,y,aj)K{uj)di 
J a J a 


'w ±e . 


Proof First note that 

\h{x,y,u) - h{x,y,u})\ < 



dh{x,y,u}) 

1 — ) • max 

V16/ x,y£[0,l],LO 

duj 


Hence, 


J^^h{x,y,ui)K{uj)du} — J^^h{x,y,uj)K{uj)doj < f^^eK(uj)du! 


< e . 


< e. 


Given a real number z, define vectors and u^ indexed by i E {—i*, ■ ■ ■, —2, —1,0,1,2,.. 
where i* = [IGe"^ ln(4/e)] by: 


V = 


I ri^i+i I r‘^i+1 

^/zcos{uilnz)A / K{uj)duj, = ^/zsm{uJilnz)A / K{u})duj, 
y Juji y Juji 


and note that 


Therefore, 




(vf - + (uf - nf )2 = h{x, y, Ui) f K{uj)duj. 

J (jJi 


10 = h{x,y,u)K{uj)du ± e 


' W_i* 


/ h{x,y,uj)K{uj)di 
J W_j^* 

/ oo 

h{x,y,uj)K{u)du F 2e = /j(x,y)±2e 

-OO 


where the second to last line follows from Lemma [TT] and the last line follows from Lemma 
m since min(|t(;_i*|, > ln(4/£). 

Dehne the vector sF to be the vector generated by concatenating and for z E [d]. 
Then if follows that 

||a^ — a'^lll = JS(p, q) ± 2ed 

Hence we have reduced the problem of estimating JS(p, (?) to £2 estimation. Rescaling 
e ^ £/{2d) ensures the additive error is e while the length of the vectors a^ and afl is 


Theorem 12 Algorithmic embeds a set P of points under JS into O log ^ j dimensions 
under £2 withe additive error, independent of the size of\P\. 
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Note that using the JL-Lemma, the dimensionality of the target space can be reduced to 
O ( "j. Theorem [121 along with the AMS sketch of Alon et ah ( 1996l b and the standard 


assumption of polynomial precision immediately implies: 


Corollary 13 There is an algorithm that works in the aggregate streaming model to ap¬ 
proximate JS to within (1 + s)-multiplicative factor using O log ^ logd) space. 

As noted earlier, this is the first algorithm in the aggregate streaming model to obtain 
an (1 + e)-multiplicative approximation to JS, which contrasts against linear space lower 
bounds for the same problem in the update streaming model. 


4.2. Randomized embedding 

In this section we show how to embed n points of JS into £2 with (1 + e) distortion where 

d = 0{n‘^d^e~^). El 

For fixed x,y,£ [0,1], we hrst consider the random variable T where T takes the value 
h{x,y,uj) with probability (Recall that k(-) is a distribution.) We compute the first 

and second moments of T. 

Theorem 14 E[T] = fj{x,y) and var[T] < 36(/j(x, y))^. 

Proof The expectation follows immediately from the definition: 

/ OO 

h{x,y,uj)K{uj)duj = /j(x,y). 

-OO 

To bound the variance it will be useful to define the function fuix^y) = {^/x — y/y)"^ 
corresponding to the one-dimensional Hellinger distance that is related to /j(x, y) as follows. 
We now state two claims regarding fnix^y) and f^{x,y)\ 


Claim 4.1 For all x,y E [0,1], fnix^y) < 2fj{x,y). 


Proof Let fy-{x,y) = correspond to the one-dimensional distance. Then, we 

have 

fxix,y) ^ {x - yf ^ iVx + y/yf ^ x + y + 2^ ^ ^ 

fH{x,y) {x + y){y/x-^Y x + y x + y 


Th is shows that fH{x,y) < f^{x,y). To show f^{x,y) < 2fj{x,y) we refer the reader 
to ( TopsOe . 2001)1 . Section 3). Combining these two relationships gives us our claim. ■ 


We then bound h{x,y,uj) in terms of fH{x,y) as follows. 

Claim 4.2 For all x, y E [0,1], u; E M, h{x, y, uj) < fnix, y)(l -|- 2|a;|)^. 

2. If we ignore precision constraints on sampling from a continuous distribution in a streaming algorithm, 
then this also would yield a sketching bound of 0{(F£~^) for a (1 -I- e) multiplicative approximation. 
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Proof Without loss of generality, assume x > y. 
y^h{x,y,u) = |\/x • 


< - Vy • + IVy • - Vy ■ 

= \V^ - Vy\ + Vy ■ 

= \Vx-^/y\+^/y■2■\sm{u;ln{x/y)/2)\ 

< VfH{x,y) + y/y-2- \uJln{^/^)\ 

< V^fnix^ + y/y ■ 2 ■ \^/^ - 1\ ■ \u}\ 


= \lfnix, y) + 2 V fH{x, y) ■ |a;| 
and hence h{x,y,uj) < fH{x,y){l + 2|w|)^ as required. 

These claims allow us to bound the variance: 

r*oo 


var 


/ OO /*C 30 

{h{x,y,uj)fK{uj)duj < fHix,yf {1 + 2\u;\)'^K{u;)di 

-OO j —OO 

= fH{x,yf -SM < 36fj{x,yf, 


w 

2 


This naturally gives rise to the following algorithm. Let be t independent 

Algorithm 2: Embeds point p G under JS into 
Input: p = {pi,... ,p^}. 

Output: A vector of length O (n^d^e“^) 
i 1; s <r- \36'n?d?£~‘^] 
for jA— 1 to s do 
I Ldj a draw from ^( 0 ;); 
end 

for i ^ 1 to d do 
for jA— 1 to s do 

^ {^/picos{u!jlnpi)/s) 

K ^ [^/pl^xa{ujhlpi)/s) 

end 

end 

return concatenated with W. 

samples chosen according to For any distribution p on [d], define vectors v^, u^ G 

where, for i G [d],j G [t], 

vfj = ^/^i ■ cos(cjj lnpi)/t, ufj = y/jTi ■ sin(a;j lnpj)/t. 

Let be a concatenation of and u^j over all j G [t]. Then note that E[||v^ — v?!!!] = 
fj{pi, Qi) and var[||v?—v^lll] < 36(/j(pi, Qi))'^/t. Hence, for t = 36n^d^e“^, by an application 
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of the Chebyshev bound, 

- fjiPi,Qi)\ >£fj{x,y)] <36e~‘^/t = {nd)~^. (4.1) 

By an application of the union bound over all pairs of points: 

Pr[3i G[d] , p,qG P|||vf - - fjipi,qi)\ > efj{pi,qi)] < 1/d. 

And hence, if is a concatenation of over all i G [d], then with probability at least 

1 — 1/d it holds for all p,q G P: 

(l-e)JS(p,gr) < ||vP-v''|| < (1+ e)JS(p,g). 

The hnal length of the vectors is then td = 36n^d^e~^ for approximately preserving distances 
between every pair of points with probability at least 1 — This can be reduced further 
to 0(logn/e^) by simply applying the JL-Lemma. 


5. Embedding into 

We give here two algorithms for embedding the divergence into £ 2 - The computation and 
resulting two algorithms are highly aii alogous to Section [H First, the explicit formulation 
given bv IVedaldi and ZissermanI (j2012l l yields that for t, 1 / G M: 


X 


/ +00 

^luj sech(7ra;) — sech(7rti;) 

-00 


’ —00 
r+oo 


dw 


(sech( 7 ra;)) 


For convenience, we now define: 


and K^{ui) = sech(7ra;). 

We can then write x^{p,q) = X]i=i fxiPi^Qi) where 

fxi^,y) = 

It is easy to verify that k^{u}) is a distribution, i.e., J//°^ K^{Lo)duj = 1. 


h{x,y,io)K^{uj) du = 


{x - vY 
x + y 


5.1. Deterministic embedding 

We will produce an embedding (/{p) = ..., where each c/p. is an integral that we 

discretize appropriately. 


Lemma 15 For 0 < x,y,< 1, we have 0 < h{x, y,uj) <2 and 


dh{x,y,uj) 

duj 


< 16. 


Similar to Section [U the next two steps analyze truncating and quantizing the integral. 
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Algorithm 3: Embed p € under into £ 2 - 
Input: p = {pi,... ,P(i} where coordinates are ordered by arrival. 
Output: A vector of length O iog j ^ 

^^1; J^r^ln(f)l, 

for j i - J to J do 

I Wj •(— j X e/32d 

end 

for i ^ 1 to d do 

for j i - J to J — 1 do 

^ ^/FiCos{u)j Inpi)^K^{oj)duj 

^ ^ v^sin(wjlnpi).^ Ky^{uj)duj 
£^£+1 

end 

end 

return aP concatenated with IP. 


Lemma 16 (Truncation) For t > ln(3/e), 

fxix,y) > J h{x,y,u!)K^{u!)du} > fxix,y) 


— e . 


Proof The first inequality follows since h{x,y,uj) > 0. For the second inequality, we use 
h{x, y, uj) < 2: 


"-t 


/•OO /•OO /•OO 

h{x,y,u)Ky^{u)du+ / h{x,y,uj)K^{uj)duj < 4 / K^{oj)doj < 4 / 2e“™dw < 3e“^ < e 

Jt Jt Jt 

where the last line follows if t > ln(3/e). ■ 

Define tOi = ei/16 for i G {..., —2, —1,0,1, 2,...} and h{x, y, to) = h{x, y, tOi) where i = 
max{j I ujj < w}. We recall the following Lemma from Section HI 

Lemma 17 (Quantization) For any a, b, 


' pb 

h{x,y,uj)K^{uj)duj = / h{x,y,uj)K^{u)di 

J a 


'w ±e . 


Given a real number z, define vectors and indexed by i G {—i*, • • •, —2, —1,0,1,2,... i*} 
where i* = [166“^ ln(3/e)] by: 

/ r‘^i+1 j r<^i+i 

= y/zcos{(jJilnz)i / K^{oj)doj, = y/z sm{uji\n z)i / K^{u})du}, 

V J (jJi y Jui 


and note that 


(vf - v^)^ + (uf - uf )2 = h{x, y, uji) f K^{uj)duj. 

J (jJi 
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Therefore, 




L 

L 




h{x,y,oj)Ky^{ijj)du: = 


W_i* 

OO 


f 

J w. 


’i* + l 


h{x, y, u})Ky-{uj)duj ± e 


h{x,y,u})K^{uj)duj±2e = f^{x,y)±2£, 


where the second to last line follows from Lemma [T7] and the last line follows from Lemma 
[IS since min(|r(;_j*|,rcjt+i) > ln(3/e). 

Define the vector to be the vector generated by concatenating and for i e[d]. 
Then if follows that 

= x^(p,g) ±2e(i 

Hence we have reduced the problem of estimating x^{p^q) to £2 estimation. Rescaling 
e •(— £/{2d) ensures the additive error is £ while the length of the vectors and ad is 

o(^logf 


Theorem 18 Algorithmic embeds a set P of points under into O log ^ j dimensions 
under 0^ with e additive error, independent of the size of |P|. 

Theorem 1 181 along with the AMS sketch of Alon et ah ( 19961 ). and the standard assumption 
of polynomial precision immediately implies: 

Corollary 19 There is an algorithm that works in the aggregate streaming model to ap¬ 
proximate to within (1 + £)-multiplicative factor using O log ^ logd) space. 

5.2. Randomized embedding 

In this section we show how to embed n points of into £2 with (1 + e) distortion where 

d = 0{n‘^d^£~^). El 

For fixed x,y, G [0,1], we first consider the random variable T where T takes the value 
h{x,y,uj) with probability k0uj). (Recall that Hxi') ^ distribution.) We compute the 
first and second moments of T. 

Theorem 20 E[T] = fx{x,y) and var[T] < 23(/^(x, y))^. 

Proof The expectation follows immediately from the definition: 


/ OO 

h{x,y,uj)Kxiuj)du = fx{x,y). 

-OO 


To bound the variance we will again use the function fnix, y) = (\/x — corresponding 
to the one-dimensional Hellinger distance. We now state two claims relating fH{x,y) and 

Claim 5.1 For allx,y G [0,1], fH{x,y) < fx{x,y). 


3. If we ignore precision constraints on sampling from a continuous distribution in a streaming algorithm, 
then this also would yield a sketching bound of 0{cPe~^) for a (1 + e) multiplicative approximation. 


13 










Abdullah et al. 


Proof Let f^{x,y) = correspond to the one-dimensional distance. Then, we 

have 

fxix,y) ^ {x - yf ^ iVx + ^ x + y + 2^ ^ ^ 

fH{x,y) {x+ y)iy/x - yXgy x + y x + y 

This shows that fH{x,y) < f-^{x,y). ■ 


We then recall Claim 02] bounding h{x,y,co) in terms of fH{x,y) as follows. 
Claim 5.2 For all x, y G [0,1], w G M, h{x, y, cu) < /h(x, y)(l + 2|a;|)^. 

These claims allow us to bound the variance: 

/ OO POO 

(h(x,y,uj)fKy^(uj)duj < fH{x,y)‘^ {I + 2\u;\)'^K^{u;)(kj 

-OO j —OO 

= fH{x,yf-22.77 < 23f^{x,yf, 


This naturally gives rise to the following algorithm. Let wi,..., w* be t independent samples 

Algorithm 4: Embeds point p G under into £ 2 - 
Input: p = {pi,... ,pd}. 

Output: A vector aP of length O 
I 1-, s [2371^4^6“^] 
for jA— 1 to s do 
I cuj ■(— a draw from 
end 

for 7^1 to d do 
for jA— 1 to s do 

a? {^/¥iCos{LOj\npi)/s) 

^ ^ (v^sin(u;jTnpj)/s) 

i^£+l 

end 

end 

return concatenated with W. 

chosen according to Ky_{uj). For any distribution p on [d], define vectors v^,u^ G where, 
for i G [d],j G [t], 

'^ij = Vth • cos(a;j lnpj)/t, • sin(a;j lnpj)/t. 

Let be a concatenation of and ■ over all j G [t]. Then note that E[||v^ — v?!!!] = 
/x(Po Qi) nnd var[||v?—v?!!!] < 23{f^{pi, /t. Hence, for t = 23n‘^d?e~‘^, by an application 

of the Chebyshev bound, 

Pr[|||vf-v^||^-/x(pi,gi)| >e/^(x,y)] < 23£-^/t = {nd)-'^. (5.1) 
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By an application of the union bound over all pairs of points: 

Pr[3i £[d\ , p,q£ P|||vf - vfUl - fx{Pi,qi)\ > ^fxiPhQi)] < l/d. 

And hence, if is a concatenation of over all i G [d], then with probability at least 
1 - 1 /d, 

(l-e)x^(p,d) < l|v^-v''|| < {l + e)x^{p,q). 

The final length of the vectors is then td = 23n^d^e“^ for approximately preserving distances 
between every pair of points with probability at least 1 — This can be reduced further 
to O{log n/s'^) by simply applying the JL-Lemma. 

6. Dimensionality reduction 

The JL-Lemma has been instrumental for improving the speed and approximation ratios of 
learning algorithms. In this section, we give a proof of the JL-analogue for well behaved /- 
divergences. Specifically, we show that a set of n points lying on a high dimensional simplex 
can be embedded to a A: = 0(logn/e^)-dimensional simplex, while approximately preserving 
the information distances between all pairs of points. This dimension reduction amounts to 
reducing the support of the distribution from d to k, while approximately maintaining the 
divergences. 

Our proof uses £2 an intermediate space. On a high level, we first embed the points 
into a high (but finite) dimensional £2 space, using the techniques we developed in Section 
[421 We then use the Euclidean JL-Lemma to reduce the dimensionality, and remap the 
points into the interior of a simplex. Finally, we show that far away from the simplex 
boundaries, the well behaved /-divergences have the same structure as hence the em¬ 
bedding back into information spaces can be done with a simple translation and rescaling. 
Note that for /-divergences that have an embedding into finite dimensional £ 2 , the proof is 
constructive. 

Algorithm 5: DIMENSION REDUCTION FOR Df 
Input: Set P = {pi,... ,pn} of points on A^, error parameter e, constant co(e, /) 

Output: A set P of points on where k = O 

1. Embed P into £2 to obtain Pi with error parameter e/ 4 . 

2. Apply Euclidean JL-Lemma with error | to obtain P 2 in dimension k = O 

3. Remap P 2 to the plane L = {x G | Xi = 0} to obtain P 3 

4. Scale P 3 to a ball of radius cq • and center at the centroid of A^+i to obtain P. 


T o analyze the abo ve algorithm, we recall the JL-Lemma ( Johnson and Lindenstraussl 


1984 : Biau et ah . 20081 1: 


Lemma 21 For any set of points P in a (possibly infinite dimensional) Hilbert space H, 
there exists a randomized map f\H^'Mfi,k = such that \/p,q G P 

il-e)\\p- q\\l < \\f{p) - f{q)f 2 < (1 + ^)IIP - ^lll, (6-1) 
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with high probability. 

We show the following simple corollary: 

Corollary 22 For any set of points P in H there exists a constant t and a randomized 
map f:H^ ^k+i, k = such thatMp^q E P: 

(1 - e)||p - ^lli < t\\f{p) - f{q)\\l < (1 + s)\\p - q\\l (6.2) 

Furthermore for any small enough constant r, we may bound the domain of f to be a ball 
B of radius r centered at the simplex centroid, {^/k+i, ^/k+i ,..., i/fc+i) . 

Proof Consider first the map of Lemma[2T]from ^ . Now note that any set of points 

in can be isometrically embedded into the hyperplane L = {x E R*^+^ | Xi = 0}. This 
follows by remapping the basis vectors of R^ to those of L. Finally since L is parallel to 
the simplex plane, the entire pointset may be scaled by some factor t and then translated 
to fit in Afc+i, or indeed in any ball of radius r centered at the simplex centroid. ■ 

We now show that any well-behaved / divergence is nearly Euclidean near the simplex 
centroid. 

Lemma 23 Consider any well-behaved f divergence Df, and let be a ball of radius r 
such that Br C A^. and B^ is centered at the simplex centroid. Then for any fixed 0 < e < 1, 
there exists a choice of r and scaling factor t (both dependent on k) such that \/p,q E B: 

{l-s)\\p-q\\l <tDf{p,q) < {l+s)\\p-q\\l. (6.3) 

Proof We consider arbitrary p, q & Br and note that the assumptions imply each coordinate 
lies in the interval I = [^ ~ ^ r]. Let rk = e', then I = We now prove 

the lemma for p,q ^ I, the main result follows by considering Df and || • |p coordinate by 
coordinate. 

By the definition of well-behaved /-divergences and Taylor’s theorem, there exists a 
neighborhood N of 1, and function (j) with lim^;-;.! /(I) = 0 such that for all x E A: 

/(x) = /(I) + {x- !)/'(!) -h f''{l) + ix- lf4>ix) = /"(I) + {x- if fix). 


P-f{p) 

[p - g)2 

j>((v)'™ + (v’)'H?)) 

{q-p)"^ 

Ip \p J 

Recall again that p E term converges to the constant 2A:/"(1) as r 

grows smaller (and hence e' decreases). Note also that the second term goes to 0 with r, i.e.. 


Therefore: 

Dfiv, q) 
Wp-qWI 
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given a suitably small choice of r we can make the term smaller than any desired constant. 
Hence, for every dimension k and 0 < e < 1, there exists a radius of convergence r such 
that for all p,q£Br'. 

{l-e)\\p-q\\l < + (6-5) 


We note that the required value of r can be computed easily for the Hellinger and 
divergence, and that r behaves as ^ • c where c = c{f, e) is a constant depending only on e 
and the function / and not on A: or n . 

To conclude the proof note that the overall distortion is bounded by the combination 
of errors due to the initial embedding into Pi, the application of JL-Lemma, and the final 
reinterpretation of the points in A^+i. The overall error is thus bounded by, (l+s/ 4 )^ < 1+e. 

Theorem 24 Consider a set P G of n points under a well-behaved f-divergence Df. 
Then there exists a (1 + e) distortion embedding of P into A^ under Df for some choice 

of k bounded as O • Furthermore this embedding can be explicitly computed even 

for a well-behaved f -divergence with an infinite dimensional kernel, if the kernel can be 
approximated in finite dimensions within a multiplicative error as we show for JS and ■ 

7. Conclusions 

The embedding and sketching results we show here complements the known impossibility 
results for sketching information distances in the strict turnstile model, thus providing a 
more complete picture of how these distances can be estimated in a stream. The dimen¬ 
sionality reduction result essentially says that as long as the information distance admits a 
“Euclidean-like” patch somewhere in the simplex, it can be mapped to a lower dimensional 
space. This latter result is a little surprising because the Hellinger distance exhibits more (.1 
like behavior at the corners of the simplex. In fact, we conjecture that if we limit ourselves 
to mappings that are not contractive, then it is likely that the Hellinger distance will not 
admit accurate dimensionality reduction. 
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